Diamonds.Prices2022 = read.csv("/Users/noamgonen/Desktop/pstat 126/Diamonds_Prices2022.csv", header = TRUE)
head(Diamonds.Prices2022)
## X carat cut color clarity depth table price x y z
## 1 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Above I have printed the first 6 observations in the data. To have a better understanding of the data I have described each variable, here is what i have gathered:
Carat: the unit of weight of the diamond (1 carat = 200 milligrams)
Depth: length from the bottom tip, or culet, to the flat top surface divided by the width of the thickest part of the diamond (in percent)
table: the length of the top flat part of the diamond divided by the width of the thickest part of the diamond (in percent)
price: How much the diamond costs in dollars
x: length of the diamond in millimeters
y: width of the diamond in millimeters
z: height of the diamond in millimeters
cut: calculated based on the depth, table, Girdle thickness, Pavilion angle, and Crown angle of the diamond. 5 possible categories, Fair, Good, Very Good, Premium, Ideal.
color: seven possible categories, D: The highest color grade, E: A colorless grade that is highly desirable, F: A colorless grade that is highly desirable, G: A near-colorless grade that appears colorless, H: A near-colorless grade, I: Nearly-colorless, J: Nearly-colorless.
clarity: measure of the purity and rarity of the stone, graded by the visibility of these characteristics under 10-power magnification. 8 possible categories, IF: internally flawless, VVS1: very very slightly included, VVS2: very very slightly included, VS1: very slightly included, VS2: very slightly included, SI1: slightly included, SI2: slightly included, I1: included.
Summary of the data statistics:
library(skimr)
skim(Diamonds.Prices2022)
| Name | Diamonds.Prices2022 |
| Number of rows | 53943 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| cut | 0 | 1 | 4 | 9 | 0 | 5 | 0 |
| color | 0 | 1 | 1 | 1 | 0 | 7 | 0 |
| clarity | 0 | 1 | 2 | 4 | 0 | 8 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| X | 0 | 1 | 26972.00 | 15572.15 | 1.0 | 13486.50 | 26972.00 | 40457.50 | 53943.00 | ▇▇▇▇▇ |
| carat | 0 | 1 | 0.80 | 0.47 | 0.2 | 0.40 | 0.70 | 1.04 | 5.01 | ▇▂▁▁▁ |
| depth | 0 | 1 | 61.75 | 1.43 | 43.0 | 61.00 | 61.80 | 62.50 | 79.00 | ▁▁▇▁▁ |
| table | 0 | 1 | 57.46 | 2.23 | 43.0 | 56.00 | 57.00 | 59.00 | 95.00 | ▁▇▁▁▁ |
| price | 0 | 1 | 3932.73 | 3989.34 | 326.0 | 950.00 | 2401.00 | 5324.00 | 18823.00 | ▇▂▁▁▁ |
| x | 0 | 1 | 5.73 | 1.12 | 0.0 | 4.71 | 5.70 | 6.54 | 10.74 | ▁▁▇▃▁ |
| y | 0 | 1 | 5.73 | 1.14 | 0.0 | 4.72 | 5.71 | 6.54 | 58.90 | ▇▁▁▁▁ |
| z | 0 | 1 | 3.54 | 0.71 | 0.0 | 2.91 | 3.53 | 4.04 | 31.80 | ▇▁▁▁▁ |
Summary of the new data after selecting 500 random observations:
set.seed(123)
DiamondData <- Diamonds.Prices2022 %>% sample_n(500) %>% unique() %>% select(-(X))
attach(DiamondData)
skim(DiamondData)
| Name | DiamondData |
| Number of rows | 500 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| cut | 0 | 1 | 4 | 9 | 0 | 5 | 0 |
| color | 0 | 1 | 1 | 1 | 0 | 7 | 0 |
| clarity | 0 | 1 | 2 | 4 | 0 | 8 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| carat | 0 | 1 | 0.80 | 0.46 | 0.23 | 0.38 | 0.71 | 1.05 | 2.30 | ▇▆▂▂▁ |
| depth | 0 | 1 | 61.83 | 1.36 | 57.80 | 61.00 | 61.90 | 62.50 | 67.70 | ▁▆▇▁▁ |
| table | 0 | 1 | 57.38 | 2.21 | 52.00 | 56.00 | 57.00 | 59.00 | 66.00 | ▁▇▅▁▁ |
| price | 0 | 1 | 3922.04 | 3970.49 | 368.00 | 956.75 | 2537.00 | 5283.75 | 18118.00 | ▇▃▁▁▁ |
| x | 0 | 1 | 5.73 | 1.12 | 3.93 | 4.65 | 5.73 | 6.56 | 8.43 | ▇▅▇▅▂ |
| y | 0 | 1 | 5.73 | 1.11 | 3.96 | 4.64 | 5.72 | 6.55 | 8.46 | ▇▅▇▃▂ |
| z | 0 | 1 | 3.54 | 0.69 | 2.38 | 2.85 | 3.53 | 4.03 | 5.28 | ▇▆▇▃▂ |
Here we can see we have 300 rows (obs) and 10 columns (variables), 9 being the varibales that I can use to create my regression model and the 10th column is an X column to mark the observation number. 3 of the variables are character types and 7 are numeric (including the X column). The three character variables are cut, clarity and color, we can see we have no missing values, and the cut variable has 5 unique categories, clarity has 8 and the color variable has 7 categories. For the numeric variables we can see their means, standard deviations, quantiles, and a small look at their histograms which to summarize is as follows:
Carat: no missing observations, mean: 0.82, sd: 0.46, median: 0.71, range: (0.2300, 2.300), the distribution is skewed right, with very large outliers.
Price: no missing observations, mean: 4082.67, sd: 4000.41, median: 2642.5, range: (383.0, 18108.0), the distribution is right skewed.
Depth: no missing observations, mean: 61.85, sd: 1.33, median: 61.90, range: (57.80, 67.20), the distribution shows small and large outliers in the population, but data is centered in the middle.
Table: no missing observations, mean: 57.28, sd: 2.16, median: 57.00, range: (53.00, 66.00), distribution shows a large outlier as data is skewed right.
X: no missing observations, mean: 5.79, sd: 1.11, median: 5.79, range: (3.93, 8.43), data is spread out with no outliers.
Y: no missing observations, mean: 5.79, sd: 1.1, median: 5.78, range: (3.96, 8.46), data is spread out with no outliers.
Z: no missing observations, mean: 3.58, sd: 0.68, median: 3.54, range: (2.38, 5.28), data is spread out with no outliers.
From the two summaries above, I can see that the distributions of the numerical variables weren’t effected, the shapes stayed the same, the only noticeable change was in the x,y, and z variables, but that is probably due to the loss of outliars.
ggplot(Diamonds.Prices2022, aes(x=color)) + geom_bar(fill='red') + labs(x='Color')
ggplot(DiamondData, aes(x=color)) + geom_bar(fill='blue') + labs(x='Color')
ggplot(Diamonds.Prices2022, aes(x=cut)) + geom_bar(fill='red') + labs(x='Color')
ggplot(DiamondData, aes(x=cut)) + geom_bar(fill='blue') + labs(x='Color')
ggplot(Diamonds.Prices2022, aes(x=clarity)) + geom_bar(fill='red') + labs(x='Color')
ggplot(DiamondData, aes(x=clarity)) + geom_bar(fill='blue') + labs(x='Color')
The population data distribution (red) compared to the the sample data distribution (blue) all look nearly identical for all three of the categorical variables. Overall I would say I was able to get a representative sample data.
\(price = \beta_0 + \beta_1 carat + \beta_2 cut + \beta_3 color + \beta_4 clarity + \beta_5 depth + \beta_1 table + \beta_2 x + \beta_3 y + \beta_4 z + \epsilon\)
Summary of Regression Model
lm1 = lm(price ~ carat + table + depth + cut + color + clarity + x + y + z)
summary(lm1)
##
## Call:
## lm(formula = price ~ carat + table + depth + cut + color + clarity +
## x + y + z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4166.1 -559.7 -164.5 383.5 5285.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8797.41 13449.58 -0.654 0.5134
## carat 13777.32 573.02 24.043 < 2e-16 ***
## table -50.43 30.63 -1.646 0.1004
## depth 195.40 211.10 0.926 0.3551
## cutGood 613.22 357.50 1.715 0.0869 .
## cutIdeal 824.32 369.45 2.231 0.0261 *
## cutPremium 655.62 348.33 1.882 0.0604 .
## cutVery Good 885.34 351.54 2.518 0.0121 *
## colorE -196.33 190.13 -1.033 0.3023
## colorF -175.35 187.31 -0.936 0.3497
## colorG -438.40 187.33 -2.340 0.0197 *
## colorH -1038.16 193.84 -5.356 1.33e-07 ***
## colorI -1468.93 230.78 -6.365 4.59e-10 ***
## colorJ -2188.73 265.30 -8.250 1.56e-15 ***
## clarityIF 4604.14 436.94 10.537 < 2e-16 ***
## claritySI1 3463.71 374.63 9.246 < 2e-16 ***
## claritySI2 2473.12 381.44 6.484 2.24e-10 ***
## clarityVS1 4373.50 388.14 11.268 < 2e-16 ***
## clarityVS2 4373.75 384.18 11.385 < 2e-16 ***
## clarityVVS1 4774.12 426.40 11.196 < 2e-16 ***
## clarityVVS2 4765.38 408.61 11.663 < 2e-16 ***
## x -874.30 1552.45 -0.563 0.5736
## y 1803.14 1591.29 1.133 0.2577
## z -4735.42 3392.24 -1.396 0.1634
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1078 on 476 degrees of freedom
## Multiple R-squared: 0.9296, Adjusted R-squared: 0.9262
## F-statistic: 273.4 on 23 and 476 DF, p-value: < 2.2e-16
plot(DiamondData)
From the above correlation matrix, I can see that variables x, y, z, and carat have strong positive linear correlations. To check this further, I will calculate the correlation of each pair, if the variables have a correlation of 1 then they have a perfect positive linear relationship.
DiamondData2 <- subset(DiamondData, select = c("price", "carat", "depth", "table", "x", "y", "z"))
corrplot(cor(DiamondData2), method = "number")
Above we can see the correlation between all the variables and, as assumed, price, carat, x, y, and z have very strong correlations whereas table and depth do not.
Because carat, x, y and z have very strong correlations only one of them can be used in my regression model because its redundant and causes multicollinearity to use them all.
Next step is to conduct significance tests at \(\alpha\) = 0.05 to test which variables are significantly related to the response variable, price. To do this I will conduct a hypothesis test to check the significance of the slope in the simple linear regression models between the numerical dependent variables and the independent variable. I will use the hypothesis:
\(H0: \beta_1 = 0\)
\(H1: \beta_1 \not= 0\)
linear model for price and carat:
lm_carat = lm(price ~ carat)
summary(lm_carat)
##
## Call:
## lm(formula = price ~ carat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5785.6 -850.9 1.8 619.4 7712.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2361.6 139.6 -16.92 <2e-16 ***
## carat 7868.4 151.2 52.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1566 on 498 degrees of freedom
## Multiple R-squared: 0.8447, Adjusted R-squared: 0.8443
## F-statistic: 2708 on 1 and 498 DF, p-value: < 2.2e-16
the fitted model can be explained by the following equation:
\(price = -2361.6 + 7868.4 carat + \epsilon\)
and has a p-value less than 0.05, therefore \(HO\) is rejected, and we conclude that carat has a significant relation to price. The \(R^2adj\) is 0.844 which means about 84% of the variance in the dependent variable, price, can be explained by the model.
linear model for price and depth:
lm_depth = lm(price ~ depth)
summary(lm_depth)
##
## Call:
## lm(formula = price ~ depth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3548 -2958 -1376 1358 14224
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2365.29 8091.14 0.292 0.770
## depth 25.18 130.83 0.192 0.847
##
## Residual standard error: 3974 on 498 degrees of freedom
## Multiple R-squared: 7.436e-05, Adjusted R-squared: -0.001934
## F-statistic: 0.03704 on 1 and 498 DF, p-value: 0.8475
the fitted model can be explained by the following equation:
\(price = 2365.29 + 25.18 depth + \epsilon\)
and has a p-value greater than 0.05, therefore \(HO\) is accepted, and we conclude that depth does not have a significant relation to price. The \(R^2adj\) is -0.0019 which is not good, and might be something to consider for my later model.
linear model for price and table:
lm_table = lm(price ~ table)
summary(lm_table)
##
## Call:
## lm(formula = price ~ table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5448 -2763 -1427 1299 14969
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9375.51 4574.40 -2.050 0.04093 *
## table 231.75 79.66 2.909 0.00379 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3941 on 498 degrees of freedom
## Multiple R-squared: 0.01671, Adjusted R-squared: 0.01474
## F-statistic: 8.463 on 1 and 498 DF, p-value: 0.003787
the fitted model can be explained by the following equation:
\(price = -9375.51 + 231.75 table + \epsilon\)
and has a p-value less than 0.05, therefore \(HO\) is rejected, and we conclude that table has a significant relation to price. The \(R^2adj\) 0.0147 which is not the best value, but I don’t think it would hurt my model to include.
linear model for price and x (length):
lm_x = lm(price ~ x)
summary(lm_x)
##
## Call:
## lm(formula = price ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4662.8 -1283.9 -215.9 971.3 8662.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13912.77 446.85 -31.14 <2e-16 ***
## x 3111.51 76.52 40.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1912 on 498 degrees of freedom
## Multiple R-squared: 0.7685, Adjusted R-squared: 0.7681
## F-statistic: 1654 on 1 and 498 DF, p-value: < 2.2e-16
the fitted model can be explained by the following equation:
\(price = -13912.77 + 3111.51 length + \epsilon\)
and has a p-value less than 0.05, therefore \(HO\) is rejected, and we conclude that length has a significant relation to price. The \(R^2adj\) is 0.768 which means about 76.8% of the variance in the dependent variable, price, can be explained by the model.
linear model for price and y (width):
lm_y = lm(price ~ y)
summary(lm_y)
##
## Call:
## lm(formula = price ~ y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4596.8 -1313.9 -254.2 986.6 8468.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14083.71 442.05 -31.86 <2e-16 ***
## y 3140.49 75.69 41.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1883 on 498 degrees of freedom
## Multiple R-squared: 0.7756, Adjusted R-squared: 0.7752
## F-statistic: 1722 on 1 and 498 DF, p-value: < 2.2e-16
the fitted model can be explained by the following equation:
\(price = -14083.71 + 3140.49 width + \epsilon\)
and has a p-value less than 0.05, therefore \(HO\) is rejected, and we conclude that width has a significant relation to price. The \(R^2adj\) is 0.7752 which means about 77.52% of the variance in the dependent variable, price, can be explained by the model.
linear model for price and z (height):
lm_z = lm(price ~ z)
summary(lm_z)
##
## Call:
## lm(formula = price ~ z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6444.0 -1280.9 -193.0 914.9 8420.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13743.6 452.1 -30.40 <2e-16 ***
## z 4983.7 125.2 39.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1943 on 498 degrees of freedom
## Multiple R-squared: 0.761, Adjusted R-squared: 0.7605
## F-statistic: 1585 on 1 and 498 DF, p-value: < 2.2e-16
the fitted model can be explained by the following equation:
\(price = -13743.6 + 4983.7 height + \epsilon\)
and has a p-value less than 0.05, therefore \(HO\) is rejected, and we conclude that height has a significant relation to price. The \(R^2adj\) is 0.7605 which means about 76.05% of the variance in the dependent variable, price, can be explained by the model.
Next I will find a confidence interval for the two variables price and carat.
summary_stats <- summary(lm_carat)
coeffs <- summary_stats$coefficients[, "Estimate"]
standard_error <- summary_stats$coefficients[, "Std. Error"]
t_crit <- qt(1-0.05/2, df = summary_stats$df[2])
lower_bound <- coeffs - t_crit * standard_error
upper_bound <- coeffs + t_crit * standard_error
conf_int <- cbind(lower_bound, upper_bound)
conf_int
## lower_bound upper_bound
## (Intercept) -2635.935 -2087.360
## carat 7571.296 8165.467
Based on the results of the above code, I can say I am 96% confident that the true effect of carat on price is between 7571.296 and 8165.465 dollars per carat.
I further check the numeric variables for linearity and for constant variance by plotting the simple linear models on a residuals versus fits plot.
Carat and Price Residual Fitted Plot
residuals_carat <- residuals(lm_carat)
fitted_carat <- fitted(lm_carat)
plot(fitted_carat, residuals_carat, xlab='Fitted', ylab = 'Residuals', main = 'Carat vs. Price | Residuals vs. Fitted')
This plot is very crowded on the left side and spread out on the right side which suggests a non constant variance (heteroscedasticity) which is not what we want. To try and fix this I can transform the linear model by taking the log of the model.
slr_carat_adj <- lm(log(price)~log(carat))
residuals_carat_adj <- residuals(slr_carat_adj)
fitted_carat_adj <- fitted(slr_carat_adj)
plot(fitted_carat_adj, residuals_carat_adj, xlab='Fitted', ylab = 'Residuals', main = 'Carat vs. Price | Residuals vs. Fitted Adjusted')
after taking the log of both carat and price we can see the spread of the plot better and we also notice no pattern, proving the assumption of linearity and constant variance.
Depth and Price Residual Fitted Plot
residuals_depth <- residuals(lm_depth)
fitted_depth <- fitted(lm_depth)
plot(fitted_depth, residuals_depth, xlab='Fitted', ylab = 'Residuals', main = 'Depth vs. Price | Residuals vs. Fitted')
This is not centered around 0, and not very spread out which is not what we want. To try and fix this I can transform the linear model by taking the log once again.
slr_depth_adj <- lm(log(price)~log(depth))
residuals_depth_adj <- residuals(slr_depth_adj)
fitted_depth_adj <- fitted(slr_depth_adj)
plot(fitted_depth_adj, residuals_depth_adj, xlab='Fitted', ylab = 'Residuals', main = 'Depth vs. Price | Residuals vs. Fitted Adjusted')
This definitely made the plot a lot better, we can now see a constant variance and linearity.
Table and Price Residual Fitted Plot
residuals_table <- residuals(lm_table)
fitted_table <- fitted(lm_table)
plot(fitted_table, residuals_table, xlab='Fitted', ylab = 'Residuals', main = 'Table vs. Price | Residuals vs. Fitted')
This plot is showing the points in vertical lines which is normal for this variable since the table variable only has whole numbers. I can transform this as well by taking the log to make the points more continuous.
slr_table_adj <- lm(log(price)~log(table))
residuals_table_adj <- residuals(slr_table_adj)
fitted_table_adj <- fitted(slr_table_adj)
plot(fitted_table_adj, residuals_table_adj, xlab='Fitted', ylab = 'Residuals', main = 'Table vs. Price | Residuals vs. Fitted Adjusted')
Now we can see the spread of the plot better and we also notice no pattern, proving the assumption of linearity and constant variance.
Length and Price Residual Fitted Plot
residuals_x <- residuals(lm_x)
fitted_x <- fitted(lm_x)
plot(fitted_x, residuals_x, xlab='Fitted', ylab = 'Residuals', main = 'Length vs. Price | Residuals vs. Fitted')
The x (length) variable is having the same issue as the carat plot above.
slr_x_adj <- lm(log(price)~x)
residuals_x_adj <- residuals(slr_x_adj)
fitted_x_adj <- fitted(slr_x_adj)
plot(fitted_x_adj, residuals_x_adj, xlab='Fitted', ylab = 'Residuals', main = 'Length vs. Price | Residuals vs. Fitted Adjusted')
However unlike the carat plot, once transformed the plot is looking much better, its way more spread out, and there is no noticeable pattern, proving the assumption of linearity and constant variance
Width and Price Residual Fitted Plot
residuals_y <- residuals(lm_y)
fitted_y <- fitted(lm_y)
plot(fitted_y, residuals_y, xlab='Fitted', ylab = 'Residuals', main = 'Width vs. Price | Residuals vs. Fitted')
same as above
slr_y_adj <- lm(log(price)~y)
residuals_y_adj <- residuals(slr_y_adj)
fitted_y_adj <- fitted(slr_y_adj)
plot(fitted_y_adj, residuals_y_adj, xlab='Fitted', ylab = 'Residuals', main = 'Width vs. Price | Residuals vs. Fitted Adjusted')
same conclusion. Proving the assumption of linearity and constant variance.
Height and Price Residual Fitted Plot
residuals_z <- residuals(lm_z)
fitted_z <- fitted(lm_z)
plot(fitted_z, residuals_z, xlab='Fitted', ylab = 'Residuals', main = 'Height vs. Price | Residuals vs. Fitted')
same as above
slr_z_adj <- lm(log(price)~z)
residuals_z_adj <- residuals(slr_z_adj)
fitted_z_adj <- fitted(slr_z_adj)
plot(fitted_x_adj, residuals_z_adj, xlab='Fitted', ylab = 'Residuals', main = 'Height vs. Price | Residuals vs. Fitted Adjusted')
Same conclusion. Proving the assumption of linearity and constant variance.
Next I need to check for normality to make sure the error term is normally distributed
Carat Normality Plot
qqnorm(residuals_carat_adj)
Depth Normality Plot
qqnorm(residuals_depth_adj)
Table Normality Plot
qqnorm(residuals_table_adj)
Length Normality Plot
qqnorm(residuals_x_adj)
Width Normality Plot
qqnorm(residuals_y_adj)
Height Normality Plot
qqnorm(residuals_z_adj)
All the normality plots have a strong linear relationship when I transform the response variable. This tells me they all follow the normality assumption.
After testing all the error assumptions: constant variance, linear, and normally distributed. All of my variables after undergoing a transformation were able to pass all the assumptions, and are all candidates to use in my final model.
Summary of Adjusted Variables
summary(slr_carat_adj)
##
## Call:
## lm(formula = log(price) ~ log(carat))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.00690 -0.16608 0.00384 0.15534 0.82045
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.44370 0.01368 617.30 <2e-16 ***
## log(carat) 1.66870 0.01925 86.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2544 on 498 degrees of freedom
## Multiple R-squared: 0.9378, Adjusted R-squared: 0.9377
## F-statistic: 7512 on 1 and 498 DF, p-value: < 2.2e-16
summary(slr_depth_adj)
##
## Call:
## lm(formula = log(price) ~ log(depth))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.87309 -0.91927 0.06128 0.79933 2.04127
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.803 8.599 0.326 0.745
## log(depth) 1.208 2.085 0.579 0.563
##
## Residual standard error: 1.02 on 498 degrees of freedom
## Multiple R-squared: 0.0006736, Adjusted R-squared: -0.001333
## F-statistic: 0.3357 on 1 and 498 DF, p-value: 0.5626
summary(slr_table_adj)
##
## Call:
## lm(formula = log(price) ~ log(table))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.43086 -0.88148 0.01703 0.73660 2.36070
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.306 4.733 -3.234 0.0013 **
## log(table) 5.703 1.169 4.879 1.44e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9966 on 498 degrees of freedom
## Multiple R-squared: 0.04562, Adjusted R-squared: 0.0437
## F-statistic: 23.8 on 1 and 498 DF, p-value: 1.439e-06
summary(slr_x_adj)
##
## Call:
## lm(formula = log(price) ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.82411 -0.17176 0.01574 0.16500 0.85079
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.74779 0.06276 43.78 <2e-16 ***
## x 0.87888 0.01075 81.77 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2686 on 498 degrees of freedom
## Multiple R-squared: 0.9307, Adjusted R-squared: 0.9305
## F-statistic: 6687 on 1 and 498 DF, p-value: < 2.2e-16
summary(slr_y_adj)
##
## Call:
## lm(formula = log(price) ~ y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.80200 -0.17789 0.01491 0.16425 0.80330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.71616 0.06194 43.85 <2e-16 ***
## y 0.88416 0.01061 83.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2638 on 498 degrees of freedom
## Multiple R-squared: 0.9331, Adjusted R-squared: 0.933
## F-statistic: 6950 on 1 and 498 DF, p-value: < 2.2e-16
summary(slr_z_adj)
##
## Call:
## lm(formula = log(price) ~ z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.32813 -0.18182 0.00425 0.17763 0.88272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.7931 0.0661 42.26 <2e-16 ***
## z 1.4084 0.0183 76.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2841 on 498 degrees of freedom
## Multiple R-squared: 0.9225, Adjusted R-squared: 0.9223
## F-statistic: 5924 on 1 and 498 DF, p-value: < 2.2e-16
Here we can see how our significance and the \(R^2adj\) values have changed. Depth, even though now fits the error assumptions, still does not have a significant effect on price, so we will rule that out. Table’s significance has grown, and the \(R^2adj\) grew which is good for my model.
linear model for price and cut:
lm_cut = lm(log(price) ~ log(carat) + cut)
summary(lm_cut)
##
## Call:
## lm(formula = log(price) ~ log(carat) + cut)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.72340 -0.16972 0.00027 0.15934 0.76215
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.14488 0.06188 131.634 < 2e-16 ***
## log(carat) 1.69512 0.01963 86.343 < 2e-16 ***
## cutGood 0.29396 0.07285 4.035 6.32e-05 ***
## cutIdeal 0.35485 0.06532 5.432 8.74e-08 ***
## cutPremium 0.27820 0.06586 4.224 2.86e-05 ***
## cutVery Good 0.31075 0.06656 4.668 3.92e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2475 on 494 degrees of freedom
## Multiple R-squared: 0.9416, Adjusted R-squared: 0.941
## F-statistic: 1593 on 5 and 494 DF, p-value: < 2.2e-16
the fitted model can be explained by the following equation:
\(price = -4367.3 + 8058.1carat + 1642.1 cutGood - 2096.7 cutIdeal + 1547.6 cutPremium + 2096.4 cutVeryGood + \epsilon\)
and has a p-value less than 0.05, therefore \(HO\) is rejected, and we conclude that cut has a significant relation to price
linear model for price and color:
lm_color = lm(log(price) ~ log(carat) + color)
summary(lm_color)
##
## Call:
## lm(formula = log(price) ~ log(carat) + color)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0590 -0.1305 -0.0025 0.1373 0.6669
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.59395 0.03319 258.930 < 2e-16 ***
## log(carat) 1.70755 0.01812 94.242 < 2e-16 ***
## colorE -0.08220 0.04028 -2.041 0.0418 *
## colorF -0.09463 0.03952 -2.395 0.0170 *
## colorG -0.07312 0.03908 -1.871 0.0619 .
## colorH -0.22933 0.04042 -5.674 2.39e-08 ***
## colorI -0.26524 0.04842 -5.478 6.87e-08 ***
## colorJ -0.45093 0.05590 -8.067 5.55e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2325 on 492 degrees of freedom
## Multiple R-squared: 0.9487, Adjusted R-squared: 0.948
## F-statistic: 1299 on 7 and 492 DF, p-value: < 2.2e-16
the fitted model can be explained by the following equation:
\(price = 3517 -443.8 colorE + 616.2 colorF -163.7colorG + 1047.9colorH + 1811.3 colorI + 1047.3 colorJ+ \epsilon\)
and has a p-value less than 0.05, therefore \(HO\) is rejected, and we conclude that color has a significant relation to price
linear model for price and clarity:
lm_clarity = lm(log(price) ~ log(carat) + clarity)
summary(lm_clarity)
##
## Call:
## lm(formula = log(price) ~ log(carat) + clarity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.56731 -0.09920 0.00946 0.12845 0.54031
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.77622 0.05846 133.02 < 2e-16 ***
## log(carat) 1.80368 0.01545 116.76 < 2e-16 ***
## clarityIF 1.04194 0.07097 14.68 < 2e-16 ***
## claritySI1 0.63001 0.06099 10.33 < 2e-16 ***
## claritySI2 0.48428 0.06209 7.80 3.76e-14 ***
## clarityVS1 0.82233 0.06293 13.07 < 2e-16 ***
## clarityVS2 0.75568 0.06204 12.18 < 2e-16 ***
## clarityVVS1 1.01884 0.06929 14.70 < 2e-16 ***
## clarityVVS2 0.93599 0.06636 14.10 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1846 on 491 degrees of freedom
## Multiple R-squared: 0.9677, Adjusted R-squared: 0.9672
## F-statistic: 1840 on 8 and 491 DF, p-value: < 2.2e-16
the fitted model can be explained by the following equation:
\(price = 4055.8 -2082.8 clarityIF + 81.7 claritySI1 + 1102.0 claritySI2 - 536 clarityVS1 - 132.7 clarityVS2 - 1154.6 clarityVVS1 - 483.7clarityVVS2 + \epsilon\)
and has a p-value less than 0.05, therefore \(HO\) is rejected, and we conclude that clarity has a significant relation to price
Overall I would say color and clarity are the best categorical variables as they have the smallest p values and the highest \(R^2adj\), however cut is also a good variable in the model.
the \(R^2adj\) value for the transformed carat price model in 0.9377, I will now test the change in this value when I add more variables into the model. I already saw the effect when adding the categorical variables. The value of \(R^2adj\) went up to 0.941 with cut, 0.948 with color, and 0.967 with clarity.
Adding just the variable table increased \(R^2adj\) to 0.9382, adding just x increased to 0.9382, just y increased it to 0.9391, and z increased it to 0.9377. Surprisingly adding depth increased the \(R^2adj\) value the most.
adding depth and table to the price~carat model increased it to 0.9405, adding x, y, or z to that model did not increase the \(R^2adj\) value so those get emitted.
Next let’s try adding the categorical variables to the price~ carat + depth + table model.
x and y to the price~carat model all improved my \(R^2\) value, however adding z to the model did not. This tells me there is multicollinearity with the z variable and carat. Adding cut increased it to 0.9416, clarity increased it to 0.9684 and including color increases it even higher to 0.985.
However, I noticed when running the summary that table and depth were not significant in my model so I removed them and instead added z. This ended up giving me a model with an \(R^2adj\) of 0.9853.
After completing the tests above, I have decided this is the best model:
\[ log(price) = \beta_0 + \beta_1log(carat) + \beta_2color + \beta_3clarity + \beta_4cut + \beta_5z + \epsilon \]
lm2 = lm(log(price) ~ log(carat) + z + cut + clarity + color)
summary(lm2)
##
## Call:
## lm(formula = log(price) ~ log(carat) + z + cut + clarity + color)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.37790 -0.08161 -0.00223 0.08039 0.34837
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.79092 0.30134 22.536 < 2e-16 ***
## log(carat) 1.59796 0.08492 18.817 < 2e-16 ***
## z 0.24569 0.07248 3.390 0.000757 ***
## cutGood 0.13077 0.03837 3.408 0.000710 ***
## cutIdeal 0.20503 0.03462 5.922 6.05e-09 ***
## cutPremium 0.16490 0.03520 4.685 3.66e-06 ***
## cutVery Good 0.17011 0.03501 4.859 1.60e-06 ***
## clarityIF 1.10852 0.04963 22.335 < 2e-16 ***
## claritySI1 0.65966 0.04291 15.374 < 2e-16 ***
## claritySI2 0.48112 0.04361 11.033 < 2e-16 ***
## clarityVS1 0.87187 0.04421 19.720 < 2e-16 ***
## clarityVS2 0.80088 0.04375 18.308 < 2e-16 ***
## clarityVVS1 1.08047 0.04845 22.302 < 2e-16 ***
## clarityVVS2 0.96875 0.04642 20.867 < 2e-16 ***
## colorE -0.06388 0.02166 -2.950 0.003336 **
## colorF -0.11449 0.02133 -5.368 1.24e-07 ***
## colorG -0.14506 0.02129 -6.815 2.84e-11 ***
## colorH -0.27813 0.02198 -12.655 < 2e-16 ***
## colorI -0.39323 0.02632 -14.941 < 2e-16 ***
## colorJ -0.55634 0.03025 -18.389 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1234 on 480 degrees of freedom
## Multiple R-squared: 0.9859, Adjusted R-squared: 0.9853
## F-statistic: 1766 on 19 and 480 DF, p-value: < 2.2e-16
Transforming the response variable by taking the log of price, and the log of carat helped increase the \(R^2adj\) value, so I will continue to use this transformation. I also noticed including table and depth didn’t help my model due to the lack of correlation to price causing overfitting, and x and y were not helpful because of their strong correlation to carat which caused multicollinearity.
Now I will create a model using criterion based methods to select the variables. I will start by using Akaike Information Criterion (AIC) to tell me which predictors to select. Below I have generated models using three different testing methods: backward elimination, forward selection, and step wise methods.
AIC Backward Elimination
full_model <- lm(log(price)~., DiamondData)
step(full_model, direction='backward')
## Start: AIC=-2059.6
## log(price) ~ carat + cut + color + clarity + depth + table +
## x + y + z
##
## Df Sum of Sq RSS AIC
## - y 1 0.0078 7.3924 -2061.1
## - depth 1 0.0121 7.3968 -2060.8
## <none> 7.3846 -2059.6
## - z 1 0.0623 7.4469 -2057.4
## - table 1 0.0955 7.4801 -2055.2
## - cut 4 0.3347 7.7193 -2045.4
## - x 1 0.2827 7.6673 -2042.8
## - carat 1 3.2543 10.6390 -1879.0
## - color 6 9.1766 16.5613 -1667.8
## - clarity 7 17.0052 24.3898 -1476.2
##
## Step: AIC=-2061.08
## log(price) ~ carat + cut + color + clarity + depth + table +
## x + z
##
## Df Sum of Sq RSS AIC
## - depth 1 0.0047 7.3972 -2062.8
## <none> 7.3924 -2061.1
## - table 1 0.0951 7.4875 -2056.7
## - z 1 0.1730 7.5654 -2051.5
## - cut 4 0.3908 7.7832 -2043.3
## - x 1 0.2987 7.6911 -2043.3
## - carat 1 3.2532 10.6457 -1880.7
## - color 6 9.1774 16.5699 -1669.5
## - clarity 7 17.2383 24.6308 -1473.3
##
## Step: AIC=-2062.76
## log(price) ~ carat + cut + color + clarity + table + x + z
##
## Df Sum of Sq RSS AIC
## <none> 7.3972 -2062.8
## - table 1 0.0904 7.4876 -2058.7
## - cut 4 0.4016 7.7987 -2044.3
## - x 1 2.3920 9.7891 -1924.7
## - z 1 2.5299 9.9270 -1917.7
## - carat 1 3.2959 10.6931 -1880.5
## - color 6 9.1986 16.5957 -1670.7
## - clarity 7 17.7297 25.1269 -1465.3
##
## Call:
## lm(formula = log(price) ~ carat + cut + color + clarity + table +
## x + z, data = DiamondData)
##
## Coefficients:
## (Intercept) carat cutGood cutIdeal cutPremium
## -0.65205 -0.96135 0.12432 0.18682 0.13700
## cutVery Good colorE colorF colorG colorH
## 0.16381 -0.05103 -0.11215 -0.14390 -0.27686
## colorI colorJ clarityIF claritySI1 claritySI2
## -0.40480 -0.55660 1.17546 0.72515 0.54168
## clarityVS1 clarityVS2 clarityVVS1 clarityVVS2 table
## 0.93156 0.86275 1.13982 1.03087 0.00835
## x z
## 0.68759 1.12393
backElim_model <- lm(log(price) ~ carat + cut + color + clarity + table + x + z, data = DiamondData)
summary(backElim_model)
##
## Call:
## lm(formula = log(price) ~ carat + cut + color + clarity + table +
## x + z, data = DiamondData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.37845 -0.08277 -0.00156 0.08560 0.37921
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.652054 0.237467 -2.746 0.006263 **
## carat -0.961345 0.065873 -14.594 < 2e-16 ***
## cutGood 0.124323 0.039780 3.125 0.001885 **
## cutIdeal 0.186822 0.040829 4.576 6.06e-06 ***
## cutPremium 0.137005 0.039753 3.446 0.000618 ***
## cutVery Good 0.163807 0.038172 4.291 2.15e-05 ***
## colorE -0.051031 0.021895 -2.331 0.020183 *
## colorF -0.112154 0.021535 -5.208 2.84e-07 ***
## colorG -0.143904 0.021591 -6.665 7.31e-11 ***
## colorH -0.276855 0.022307 -12.411 < 2e-16 ***
## colorI -0.404797 0.026539 -15.253 < 2e-16 ***
## colorJ -0.556602 0.030584 -18.199 < 2e-16 ***
## clarityIF 1.175455 0.050096 23.464 < 2e-16 ***
## claritySI1 0.725152 0.043135 16.811 < 2e-16 ***
## claritySI2 0.541680 0.043956 12.323 < 2e-16 ***
## clarityVS1 0.931561 0.044555 20.908 < 2e-16 ***
## clarityVS2 0.862750 0.044072 19.576 < 2e-16 ***
## clarityVVS1 1.139818 0.048904 23.307 < 2e-16 ***
## clarityVVS2 1.030875 0.046863 21.997 < 2e-16 ***
## table 0.008350 0.003455 2.417 0.016024 *
## x 0.687591 0.055306 12.433 < 2e-16 ***
## z 1.123926 0.087903 12.786 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1244 on 478 degrees of freedom
## Multiple R-squared: 0.9857, Adjusted R-squared: 0.9851
## F-statistic: 1572 on 21 and 478 DF, p-value: < 2.2e-16
I can see that from using backward elimination, the model with the lowest AIC removed the predictors of depth and y, and provides a model with an \(R^2adj\) of 0.9851 and AIC = -2062.76. The model backwards elimination suggests is:
\[ log(price) = \beta_0 + \beta_1carat + \beta_2cut + \beta_3color + \beta_4clarity + \beta_5table + \beta_6x + \beta_7z + \epsilon \] AIC Forward Selection
null_model <- lm(log(price)~1, DiamondData)
step(null_model, direction='forward', scope = formula(full_model))
## Start: AIC=19.96
## log(price) ~ 1
##
## Df Sum of Sq RSS AIC
## + y 1 483.63 34.66 -1330.57
## + x 1 482.36 35.92 -1312.62
## + z 1 478.09 40.19 -1256.47
## + carat 1 451.09 67.19 -999.57
## + cut 4 31.36 486.93 -3.25
## + clarity 7 37.02 481.26 -3.10
## + table 1 22.95 495.33 -0.69
## <none> 518.28 19.96
## + color 6 11.54 506.74 20.69
## + depth 1 0.41 517.87 21.56
##
## Step: AIC=-1330.57
## log(price) ~ y
##
## Df Sum of Sq RSS AIC
## + clarity 7 11.7853 22.871 -1524.4
## + color 6 6.8256 27.830 -1428.2
## + carat 1 2.8617 31.794 -1371.7
## + depth 1 0.1467 34.509 -1330.7
## + z 1 0.1388 34.517 -1330.6
## <none> 34.656 -1330.6
## + table 1 0.0690 34.587 -1329.6
## + x 1 0.0307 34.625 -1329.0
## + cut 4 0.4300 34.226 -1328.8
##
## Step: AIC=-1524.38
## log(price) ~ y + clarity
##
## Df Sum of Sq RSS AIC
## + color 6 9.7834 13.087 -1791.5
## + carat 1 3.1211 19.749 -1595.7
## + z 1 1.0323 21.838 -1545.5
## + depth 1 0.8721 21.998 -1541.8
## + x 1 0.3296 22.541 -1529.6
## + cut 4 0.3917 22.479 -1525.0
## <none> 22.870 -1524.4
## + table 1 0.0003 22.870 -1522.4
##
## Step: AIC=-1791.49
## log(price) ~ y + clarity + color
##
## Df Sum of Sq RSS AIC
## + z 1 1.81523 11.272 -1864.2
## + carat 1 1.77949 11.308 -1862.6
## + depth 1 1.56548 11.522 -1853.2
## + x 1 0.64313 12.444 -1814.7
## <none> 13.087 -1791.5
## + table 1 0.01984 13.067 -1790.2
## + cut 4 0.14991 12.937 -1789.2
##
## Step: AIC=-1864.15
## log(price) ~ y + clarity + color + z
##
## Df Sum of Sq RSS AIC
## + carat 1 3.2282 8.0436 -2030.9
## + x 1 0.2556 11.0162 -1873.6
## + depth 1 0.0507 11.2212 -1864.4
## + cut 4 0.1839 11.0880 -1864.4
## <none> 11.2719 -1864.2
## + table 1 0.0305 11.2414 -1863.5
##
## Step: AIC=-2030.86
## log(price) ~ y + clarity + color + z + carat
##
## Df Sum of Sq RSS AIC
## + x 1 0.316138 7.7275 -2048.9
## + depth 1 0.106402 7.9372 -2035.5
## + cut 4 0.197591 7.8461 -2035.3
## <none> 8.0436 -2030.9
## + table 1 0.005318 8.0383 -2029.2
##
## Step: AIC=-2048.91
## log(price) ~ y + clarity + color + z + carat + x
##
## Df Sum of Sq RSS AIC
## + cut 4 0.243656 7.4838 -2056.9
## <none> 7.7275 -2048.9
## + depth 1 0.007798 7.7197 -2047.4
## + table 1 0.000089 7.7274 -2046.9
##
## Step: AIC=-2056.93
## log(price) ~ y + clarity + color + z + carat + x + cut
##
## Df Sum of Sq RSS AIC
## + table 1 0.087090 7.3968 -2060.8
## <none> 7.4838 -2056.9
## + depth 1 0.003759 7.4801 -2055.2
##
## Step: AIC=-2060.78
## log(price) ~ y + clarity + color + z + carat + x + cut + table
##
## Df Sum of Sq RSS AIC
## <none> 7.3968 -2060.8
## + depth 1 0.012125 7.3846 -2059.6
##
## Call:
## lm(formula = log(price) ~ y + clarity + color + z + carat + x +
## cut + table, data = DiamondData)
##
## Coefficients:
## (Intercept) y clarityIF claritySI1 claritySI2
## -0.646368 0.021823 1.174667 0.724791 0.541568
## clarityVS1 clarityVS2 clarityVVS1 clarityVVS2 colorE
## 0.930980 0.862124 1.139110 1.030157 -0.050855
## colorF colorG colorH colorI colorJ
## -0.112033 -0.143824 -0.276638 -0.404476 -0.556470
## z carat x cutGood cutIdeal
## 1.122292 -0.961953 0.666978 0.122646 0.184874
## cutPremium cutVery Good table
## 0.136072 0.161631 0.008273
forward_model <- lm(formula = log(price) ~ y + clarity + color + z + carat +
x + cut + table, data = DiamondData)
summary(forward_model)
##
## Call:
## lm(formula = log(price) ~ y + clarity + color + z + carat + x +
## cut + table, data = DiamondData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.37802 -0.08234 -0.00160 0.08539 0.37902
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.646368 0.240365 -2.689 0.007415 **
## y 0.021823 0.136741 0.160 0.873271
## clarityIF 1.174667 0.050390 23.312 < 2e-16 ***
## claritySI1 0.724791 0.043238 16.763 < 2e-16 ***
## claritySI2 0.541568 0.044007 12.306 < 2e-16 ***
## clarityVS1 0.930980 0.044749 20.805 < 2e-16 ***
## clarityVS2 0.862124 0.044291 19.465 < 2e-16 ***
## clarityVVS1 1.139110 0.049155 23.174 < 2e-16 ***
## clarityVVS2 1.030157 0.047127 21.859 < 2e-16 ***
## colorE -0.050855 0.021945 -2.317 0.020906 *
## colorF -0.112033 0.021570 -5.194 3.06e-07 ***
## colorG -0.143824 0.021619 -6.653 7.90e-11 ***
## colorH -0.276638 0.022372 -12.366 < 2e-16 ***
## colorI -0.404476 0.026642 -15.182 < 2e-16 ***
## colorJ -0.556470 0.030626 -18.170 < 2e-16 ***
## z 1.122292 0.088587 12.669 < 2e-16 ***
## carat -0.961953 0.066050 -14.564 < 2e-16 ***
## x 0.666978 0.140528 4.746 2.74e-06 ***
## cutGood 0.122646 0.041184 2.978 0.003049 **
## cutIdeal 0.184874 0.042654 4.334 1.78e-05 ***
## cutPremium 0.136072 0.040220 3.383 0.000776 ***
## cutVery Good 0.161631 0.040571 3.984 7.84e-05 ***
## table 0.008273 0.003491 2.370 0.018192 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1245 on 477 degrees of freedom
## Multiple R-squared: 0.9857, Adjusted R-squared: 0.9851
## F-statistic: 1498 on 22 and 477 DF, p-value: < 2.2e-16
Forward selection provided me with a different model where they did not include depth. The \(R^2adj\) of this model is also 0.9851 and AIC = -2056.93, and it looks like this:
\[ log(price) = \beta_0 + \beta_1y + \beta_2clarity + \beta_3color + \beta_4z + \beta_5carat + \beta_6x + \beta_7cut + \beta_8table + \epsilon \] AIC Step Wise Regression
null_model <- lm(log(price)~1, DiamondData)
step(null_model, direction='both', scope = formula(full_model))
## Start: AIC=19.96
## log(price) ~ 1
##
## Df Sum of Sq RSS AIC
## + y 1 483.63 34.66 -1330.57
## + x 1 482.36 35.92 -1312.62
## + z 1 478.09 40.19 -1256.47
## + carat 1 451.09 67.19 -999.57
## + cut 4 31.36 486.93 -3.25
## + clarity 7 37.02 481.26 -3.10
## + table 1 22.95 495.33 -0.69
## <none> 518.28 19.96
## + color 6 11.54 506.74 20.69
## + depth 1 0.41 517.87 21.56
##
## Step: AIC=-1330.57
## log(price) ~ y
##
## Df Sum of Sq RSS AIC
## + clarity 7 11.79 22.87 -1524.38
## + color 6 6.83 27.83 -1428.24
## + carat 1 2.86 31.79 -1371.66
## + depth 1 0.15 34.51 -1330.69
## + z 1 0.14 34.52 -1330.58
## <none> 34.66 -1330.57
## + table 1 0.07 34.59 -1329.57
## + x 1 0.03 34.63 -1329.01
## + cut 4 0.43 34.23 -1328.81
## - y 1 483.63 518.28 19.96
##
## Step: AIC=-1524.38
## log(price) ~ y + clarity
##
## Df Sum of Sq RSS AIC
## + color 6 9.78 13.09 -1791.5
## + carat 1 3.12 19.75 -1595.7
## + z 1 1.03 21.84 -1545.5
## + depth 1 0.87 22.00 -1541.8
## + x 1 0.33 22.54 -1529.6
## + cut 4 0.39 22.48 -1525.0
## <none> 22.87 -1524.4
## + table 1 0.00 22.87 -1522.4
## - clarity 7 11.79 34.66 -1330.6
## - y 1 458.39 481.26 -3.1
##
## Step: AIC=-1791.49
## log(price) ~ y + clarity + color
##
## Df Sum of Sq RSS AIC
## + z 1 1.82 11.27 -1864.15
## + carat 1 1.78 11.31 -1862.57
## + depth 1 1.57 11.52 -1853.19
## + x 1 0.64 12.44 -1814.69
## <none> 13.09 -1791.49
## + table 1 0.02 13.07 -1790.25
## + cut 4 0.15 12.94 -1789.25
## - color 6 9.78 22.87 -1524.38
## - clarity 7 14.74 27.83 -1428.24
## - y 1 455.40 468.49 -4.55
##
## Step: AIC=-1864.15
## log(price) ~ y + clarity + color + z
##
## Df Sum of Sq RSS AIC
## + carat 1 3.2282 8.0436 -2030.9
## + x 1 0.2556 11.0162 -1873.6
## + depth 1 0.0507 11.2212 -1864.4
## + cut 4 0.1839 11.0880 -1864.4
## <none> 11.2719 -1864.2
## + table 1 0.0305 11.2414 -1863.5
## - z 1 1.8152 13.0871 -1791.5
## - y 1 2.7573 14.0291 -1756.7
## - color 6 10.5663 21.8382 -1545.5
## - clarity 7 16.2254 27.4973 -1432.3
##
## Step: AIC=-2030.86
## log(price) ~ y + clarity + color + z + carat
##
## Df Sum of Sq RSS AIC
## + x 1 0.3161 7.7275 -2048.9
## + depth 1 0.1064 7.9372 -2035.5
## + cut 4 0.1976 7.8461 -2035.3
## <none> 8.0436 -2030.9
## + table 1 0.0053 8.0383 -2029.2
## - carat 1 3.2282 11.2719 -1864.2
## - z 1 3.2640 11.3076 -1862.6
## - y 1 4.7021 12.7457 -1802.7
## - color 6 9.1056 17.1493 -1664.3
## - clarity 7 17.1175 25.1611 -1474.7
##
## Step: AIC=-2048.91
## log(price) ~ y + clarity + color + z + carat + x
##
## Df Sum of Sq RSS AIC
## + cut 4 0.2437 7.4838 -2056.9
## <none> 7.7275 -2048.9
## + depth 1 0.0078 7.7197 -2047.4
## + table 1 0.0001 7.7274 -2046.9
## - y 1 0.0713 7.7988 -2046.3
## - x 1 0.3161 8.0436 -2030.9
## - z 1 2.7089 10.4365 -1900.7
## - carat 1 3.2887 11.0162 -1873.6
## - color 6 9.2518 16.9793 -1667.3
## - clarity 7 17.3953 25.1228 -1473.4
##
## Step: AIC=-2056.93
## log(price) ~ y + clarity + color + z + carat + x + cut
##
## Df Sum of Sq RSS AIC
## + table 1 0.0871 7.3968 -2060.8
## - y 1 0.0037 7.4876 -2058.7
## <none> 7.4838 -2056.9
## + depth 1 0.0038 7.4801 -2055.2
## - cut 4 0.2437 7.7275 -2048.9
## - x 1 0.3622 7.8461 -2035.3
## - z 1 2.5361 10.0199 -1913.0
## - carat 1 3.3531 10.8370 -1873.8
## - color 6 9.1841 16.6679 -1668.6
## - clarity 7 16.9391 24.4229 -1479.5
##
## Step: AIC=-2060.78
## log(price) ~ y + clarity + color + z + carat + x + cut + table
##
## Df Sum of Sq RSS AIC
## - y 1 0.0004 7.3972 -2062.8
## <none> 7.3968 -2060.8
## + depth 1 0.0121 7.3846 -2059.6
## - table 1 0.0871 7.4838 -2056.9
## - cut 4 0.3307 7.7274 -2046.9
## - x 1 0.3493 7.7461 -2039.7
## - z 1 2.4888 9.8856 -1917.8
## - carat 1 3.2891 10.6859 -1878.8
## - color 6 9.1678 16.5646 -1669.7
## - clarity 7 17.0262 24.4229 -1477.5
##
## Step: AIC=-2062.76
## log(price) ~ clarity + color + z + carat + x + cut + table
##
## Df Sum of Sq RSS AIC
## <none> 7.3972 -2062.8
## + depth 1 0.0047 7.3924 -2061.1
## + y 1 0.0004 7.3968 -2060.8
## - table 1 0.0904 7.4876 -2058.7
## - cut 4 0.4016 7.7987 -2044.3
## - x 1 2.3920 9.7891 -1924.7
## - z 1 2.5299 9.9270 -1917.7
## - carat 1 3.2959 10.6931 -1880.5
## - color 6 9.1986 16.5957 -1670.7
## - clarity 7 17.7297 25.1269 -1465.3
##
## Call:
## lm(formula = log(price) ~ clarity + color + z + carat + x + cut +
## table, data = DiamondData)
##
## Coefficients:
## (Intercept) clarityIF claritySI1 claritySI2 clarityVS1
## -0.65205 1.17546 0.72515 0.54168 0.93156
## clarityVS2 clarityVVS1 clarityVVS2 colorE colorF
## 0.86275 1.13982 1.03087 -0.05103 -0.11215
## colorG colorH colorI colorJ z
## -0.14390 -0.27686 -0.40480 -0.55660 1.12393
## carat x cutGood cutIdeal cutPremium
## -0.96135 0.68759 0.12432 0.18682 0.13700
## cutVery Good table
## 0.16381 0.00835
stepWise_model <- lm(formula = log(price) ~ clarity + color + z + carat + x +
cut + table, data = DiamondData)
summary(stepWise_model)
##
## Call:
## lm(formula = log(price) ~ clarity + color + z + carat + x + cut +
## table, data = DiamondData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.37845 -0.08277 -0.00156 0.08560 0.37921
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.652054 0.237467 -2.746 0.006263 **
## clarityIF 1.175455 0.050096 23.464 < 2e-16 ***
## claritySI1 0.725152 0.043135 16.811 < 2e-16 ***
## claritySI2 0.541680 0.043956 12.323 < 2e-16 ***
## clarityVS1 0.931561 0.044555 20.908 < 2e-16 ***
## clarityVS2 0.862750 0.044072 19.576 < 2e-16 ***
## clarityVVS1 1.139818 0.048904 23.307 < 2e-16 ***
## clarityVVS2 1.030875 0.046863 21.997 < 2e-16 ***
## colorE -0.051031 0.021895 -2.331 0.020183 *
## colorF -0.112154 0.021535 -5.208 2.84e-07 ***
## colorG -0.143904 0.021591 -6.665 7.31e-11 ***
## colorH -0.276855 0.022307 -12.411 < 2e-16 ***
## colorI -0.404797 0.026539 -15.253 < 2e-16 ***
## colorJ -0.556602 0.030584 -18.199 < 2e-16 ***
## z 1.123926 0.087903 12.786 < 2e-16 ***
## carat -0.961345 0.065873 -14.594 < 2e-16 ***
## x 0.687591 0.055306 12.433 < 2e-16 ***
## cutGood 0.124323 0.039780 3.125 0.001885 **
## cutIdeal 0.186822 0.040829 4.576 6.06e-06 ***
## cutPremium 0.137005 0.039753 3.446 0.000618 ***
## cutVery Good 0.163807 0.038172 4.291 2.15e-05 ***
## table 0.008350 0.003455 2.417 0.016024 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1244 on 478 degrees of freedom
## Multiple R-squared: 0.9857, Adjusted R-squared: 0.9851
## F-statistic: 1572 on 21 and 478 DF, p-value: < 2.2e-16
Step Wise Regression also removed depth and y from the model. And this model also has a \(R^2adj\) value of 0.9851 and AIC = -2062.76. The suggested model looks like this:
\[ log(price) = \beta_0 + \beta_1clarity + \beta_2color + \beta_3z + \beta_4carat + \beta_5x + \beta_6cut + \beta_7table + \epsilon \]
Out of the three models that I created using AIC, they all had the same \(R^2adj\) value of 0.9851, however the backward selection model and the step wise regression produced the same model and it had one less predictor, y. This model is not far of from the one I created except it includes table and doesn’t take the log of carat, and the model I created had a slightly higher \(R^2adj\) value. But to make a decision I will run a test to check the AIC value and the BIC value of the three models: my model, forward model, and step wise/backward model.
AIC(forward_model)
## [1] -639.8445
AIC(stepWise_model)
## [1] -641.8178
AIC(lm2)
## [1] -651.7323
BIC(forward_model)
## [1] -538.6939
BIC(stepWise_model)
## [1] -544.8818
BIC(lm2)
## [1] -563.2255
Out of the three models, the model I created where I removed table, y and depth as well as took the log of carat has the lowest AIC and BIC values which is ideal for a model. And as stated before it had the highest \(R^2adj\).
Next I will look at the confidence intervals for each of my predictors with the formula:
\[\hat{\beta}_1-t(\alpha / 2, n-2) \operatorname{se}\left(\hat{\beta}_1\right) \leq \beta_1 \leq \hat{\beta}_1+t(\alpha / 2, n-2) \operatorname{se}\left(\hat{\beta}_1\right)\]
confint(lm2)
## 2.5 % 97.5 %
## (Intercept) 6.19881666 7.38301882
## log(carat) 1.43109278 1.76482495
## z 0.10327460 0.38810932
## cutGood 0.05537115 0.20616803
## cutIdeal 0.13700501 0.27305171
## cutPremium 0.09573828 0.23406985
## cutVery Good 0.10132021 0.23890869
## clarityIF 1.01100100 1.20604196
## claritySI1 0.57535297 0.74397284
## claritySI2 0.39543555 0.56679837
## clarityVS1 0.78499774 0.95874552
## clarityVS2 0.71492530 0.88684046
## clarityVVS1 0.98527364 1.17566209
## clarityVVS2 0.87752534 1.05996743
## colorE -0.10642689 -0.02132536
## colorF -0.15639655 -0.07257967
## colorG -0.18689257 -0.10323727
## colorH -0.32131684 -0.23494410
## colorI -0.44494772 -0.34151972
## colorJ -0.61578441 -0.49688911
head(exp(predict(lm2, interval = "confidence", level = .95)))
## fit lwr upr
## 1 2543.8621 2420.2871 2673.7466
## 2 3026.5532 2914.9362 3142.4442
## 3 772.6731 738.3613 808.5794
## 4 729.6803 691.7040 769.7416
## 5 922.9210 873.1527 975.5261
## 6 3449.5042 3290.9238 3615.7262
head(exp(predict(lm2, interval = "prediction", level = .95)))
## Warning in predict.lm(lm2, interval = "prediction", level = 0.95): predictions on current data refer to _future_ responses
## fit lwr upr
## 1 2543.8621 1986.0277 3258.3809
## 2 3026.5532 2367.9951 3868.2615
## 3 772.6731 603.7450 988.8675
## 4 729.6803 569.2384 935.3433
## 5 922.9210 719.6764 1183.5642
## 6 3449.5042 2694.5177 4416.0332
After calculating the confidence interval and the prediction interval, I can see there ranges differ a lot. the prediction interval is way wider. This is because the prediction interval depends on both the error from the fitted model as well as the error associated with the future observations.
\[ log(price) = \beta_0 + \beta_1log(carat) + \beta_2color + \beta_3clarity + \beta_4cut + \beta_5z + \epsilon \] Throughout this analysis of the data, I investigated the effect of 9 variables on one response variable. From the beginning I was able to rule out two of the three dimension variables x,y, and z due to mullticollinearity, which makes sense due to the round shape of the diamond. When one dimension grows so do the others. I also was able to rule out depth from my model because it was not significant and did not have much correlation with price. Later in my analysis I noticed I needed to transform some of my variables like price and carat to follow the error assumption of a consistent variance. After putting my model together I noticed that table was hurting my model because it wasn’t as significant as the other variables and it was causing overfitting. Removing table increased the fit of my model and I was able to create a model where 98.53% of the variance in the dependent variable could be explained by the model.